# Multimodal Reasoning

Internvl3 8B Instruct GGUF
Apache-2.0
InternVL3-8B-Instruct is an advanced multimodal large language model (MLLM) that demonstrates exceptional overall performance, with strong multimodal perception and reasoning capabilities.
Text-to-Image Transformers
I
unsloth
2,412
1
Internvl3 14B Instruct GGUF
Apache-2.0
InternVL3-14B-Instruct is an advanced Multimodal Large Language Model (MLLM) that demonstrates exceptional multimodal perception and reasoning capabilities, supporting various tasks such as tool usage, GUI agents, industrial image analysis, and 3D visual perception.
Image-to-Text Transformers
I
unsloth
982
1
Bespoke MiniChart 7B
A 7B-parameter open-source chart understanding vision-language model developed by Bespoke Labs, outperforming closed-source models like Gemini-1.5-Pro in chart QA tasks
Text-to-Image Safetensors English
B
bespokelabs
437
12
Skywork R1V2 38B
MIT
Skywork-R1V2-38B is currently the most advanced open-source multimodal reasoning model, demonstrating outstanding performance in multiple benchmark tests with robust visual reasoning and text comprehension capabilities.
Image-to-Text Transformers
S
Skywork
1,778
105
Vica2 Init
Apache-2.0
ViCA2 is a multimodal vision-language model focused on video understanding and visual-spatial cognition tasks.
Video-to-Text Transformers English
V
nkkbr
30
0
Vica2 Stage2 Onevision Ft
Apache-2.0
ViCA2 is a 7B-parameter multimodal vision-language model focused on video understanding and visual-spatial cognition tasks.
Video-to-Text Transformers English
V
nkkbr
63
0
Internvl3 78B Hf
Other
InternVL3 is an advanced multimodal large language model series with powerful multimodal perception and reasoning capabilities, supporting image, video, and text inputs.
Image-to-Text Transformers Other
I
OpenGVLab
40
1
Spacethinker Qwen2.5VL 3B
Apache-2.0
SpaceThinker is a multimodal vision-language model that enhances spatial reasoning through test-time computation, excelling particularly in quantitative spatial reasoning and object relationship analysis.
Text-to-Image English
S
remyxai
490
7
Internvl3 9B AWQ
MIT
InternVL3-9B is a multimodal large language model from the InternVL3 series, featuring exceptional multimodal perception and reasoning capabilities. It supports various application scenarios such as tool usage, GUI agents, industrial image analysis, and 3D visual perception.
Text-to-Image Transformers Other
I
OpenGVLab
214
1
Internvl3 8B AWQ
Other
InternVL3-8B is an advanced multimodal large language model developed by OpenGVLab, featuring powerful multimodal perception and reasoning capabilities, supporting tool calling, GUI agents, industrial image analysis, 3D visual perception, and other emerging fields.
Image-to-Text Transformers Other
I
OpenGVLab
1,441
3
TBAC VLR1 3B Preview
Apache-2.0
A multimodal language model fine-tuned by Tencent PCG Basic Algorithm Center, optimized based on Qwen2.5-VL-3B-Instruct, achieving state-of-the-art performance in multiple multimodal reasoning benchmarks among models of the same scale
Image-to-Text Safetensors English
T
TencentBAC
328
11
Internvl3 9B Instruct
MIT
InternVL3-9B-Instruct is the supervised fine-tuned version of the InternVL3 series, featuring powerful multimodal perception and reasoning capabilities, supporting various modalities such as images, text, and videos.
Image-to-Text Transformers Other
I
OpenGVLab
220
2
Internvl3 8B Instruct
Other
InternVL3-8B-Instruct is an advanced Multimodal Large Language Model (MLLM) that demonstrates exceptional multimodal perception and reasoning capabilities, supporting various functionalities such as tool usage, GUI agents, industrial image analysis, and 3D visual perception.
Image-to-Text Transformers Other
I
OpenGVLab
885
2
VL Reasoner 7B
Apache-2.0
VL-Reasoner-7B is a multimodal reasoning model trained using GRPO-SSR technology, demonstrating outstanding performance across multiple multimodal reasoning benchmarks.
Text-to-Image Transformers English
V
TIGER-Lab
126
1
General Reasoner 14B Preview
Apache-2.0
A multimodal reasoning model trained on the Qwen2.5-14B base model and VisualWebInstruct-Verified dataset, supporting English task processing.
Large Language Model Transformers English
G
TIGER-Lab
33
3
Spaceqwen2.5 VL 3B Instruct GGUF
Apache-2.0
SpaceQwen2.5-VL-3B-Instruct is a multimodal vision-language model focused on spatial reasoning and embodied AI tasks.
Text-to-Image English
S
mradermacher
282
0
R01 Gemma 3 1b It
Gemma 3 is a lightweight open-source multimodal model introduced by Google, built on the same technology as Gemini, supporting text and image inputs to generate text outputs.
Text-to-Image Transformers English
R
EpistemeAI
17
1
Cogito V1
Apache-2.0
A powerful hybrid reasoning model trained through Iterative Distillation and Amplification (IDA) by DeepCogito, excelling in programming, STEM, multilingual, and agent application scenarios.
Large Language Model
C
cortexso
4,002
2
Space Voice Label Detect Beta
Apache-2.0
Fine-tuned version based on Qwen2.5-VL-3B model, trained using Unsloth and Huggingface TRL library, achieving 2x inference speed improvement
Text-to-Image Transformers English
S
devJy
38
1
Dreamer 7B
Apache-2.0
WebDreamer is a planning framework capable of achieving efficient and effective planning for web agent tasks in the real world.
Image-to-Text Transformers English
D
osunlp
62
3
3B Curr ReFT
Apache-2.0
A multimodal large language model fine-tuned from Qwen2.5-VL using the innovative Curr-ReFT method, significantly enhancing visual-language understanding and reasoning capabilities.
Text-to-Image
3
ZTE-AIM
37
3
STEVE R1 7B SFT I1 GGUF
Apache-2.0
This is a weighted/matrix quantized version of the Fanbin/STEVE-R1-7B-SFT model, suitable for resource-constrained environments.
Text-to-Image English
S
mradermacher
394
0
Videomind 2B
Bsd-3-clause
VideoMind is a multimodal agent framework that enhances video reasoning capabilities by simulating human thought processes (such as task decomposition, moment localization & verification, and answer synthesis).
Video-to-Text
V
yeliudev
207
1
Vintern 3B R Beta
MIT
Vintern-3B-R-beta is a multimodal large language model focused on complex reasoning tasks based on images, capable of decomposing reasoning steps and effectively controlling hallucination phenomena.
Image-to-Text Transformers Supports Multiple Languages
V
5CD-AI
1,841
14
Llama 3.2 11B Vision Medical
Apache-2.0
A model fine-tuned based on unsloth/Llama-3.2-11B-Vision-Instruct, trained using Unsloth and Huggingface's TRL library, achieving a 2x speedup.
Text-to-Image Transformers English
L
Varu96
25
1
Sarashina2 Vision 14b
MIT
Sarashina2-Vision-14B is a large Japanese visual language model developed by SB Intuitions, combining Sarashina2-13B with Qwen2-VL-7B's image encoder, achieving excellent performance in multiple benchmarks.
Image-to-Text Transformers Supports Multiple Languages
S
sbintuitions
192
6
Sarashina2 Vision 8b
MIT
Sarashina2-Vision-8B is a large Japanese vision-language model trained by SB Intuitions, based on the Sarashina2-7B and Qwen2-VL-7B image encoders, achieving excellent performance in multiple benchmarks.
Image-to-Text Transformers Supports Multiple Languages
S
sbintuitions
1,233
4
Visualthinker R1 Zero
MIT
The first multimodal reasoning model to reproduce 'Aha moments' and increased response length on just a 2B model with unsupervised fine-tuning
Image-to-Text Safetensors English
V
turningpoint-ai
578
6
Qwen2.5 VL 7B Instruct Quantized.w8a8
Apache-2.0
Quantized version of Qwen2.5-VL-7B-Instruct, supporting vision-text input and text output, optimized for inference efficiency through INT8 weight quantization
Image-to-Text Transformers English
Q
RedHatAI
1,992
3
UI TARS 2B SFT
Apache-2.0
UI-TARS is a next-generation native graphical user interface (GUI) agent model designed to seamlessly interact with GUIs through human-like perception, reasoning, and action capabilities.
Image-to-Text Transformers Supports Multiple Languages
U
bytedance-research
5,792
19
QVQ 72B Preview AWQ
Other
QVQ-72B-Preview is an experimental research model developed by the Qwen team, focusing on enhancing visual reasoning capabilities. This repository provides its AWQ 4-bit quantized version.
Image-to-Text Transformers English
Q
kosbu
532
8
Llamav O1
Apache-2.0
LlamaV-o1 is an advanced multimodal large language model specifically designed for complex visual reasoning tasks, optimized through curriculum learning techniques, demonstrating outstanding performance across diverse benchmarks.
Text-to-Image Safetensors English
L
omkarthawakar
1,406
93
Videolisa 3.8B
Apache-2.0
This model is a video language-guided reasoning segmentation model developed based on LLaVA-Phi-3-mini-4k-instruct, focusing on object segmentation tasks in videos.
Text-to-Image Safetensors English
V
ZechenBai
247
6
NVLM D 72B
NVLM 1.0 is a series of cutting-edge multimodal large language models that achieve state-of-the-art results in vision-language tasks, comparable to leading proprietary and open-access models.
Image-to-Text Transformers English
N
nvidia
14.33k
769
Sl Persian Ser With Gwo And Hubert
Apache-2.0
This is an open-source model based on the Apache-2.0 license. Specific details need to be added
Large Language Model Transformers
S
amirahmadian16
20
0
Emotion LLaMA
Apache-2.0
This model is released under the Apache-2.0 license, with specific details currently unknown.
Large Language Model Transformers
E
ZebangCheng
213
4
Shotluck Holmes 3.1
Apache-2.0
Large Language Model Transformers
S
RichardLuo
21
2
Nllb Uzbek Russian
Apache-2.0
This is an open-source model based on the Apache-2.0 license; specific functionalities depend on the actual model*
Large Language Model Transformers
N
sarahai
54
1
Finetuned Nli Provenance
Apache-2.0
Large Language Model Transformers
F
GuardrailsAI
360
3
Gazelle V0.2
Apache-2.0
Gazelle v0.2 is a joint speech-language model released by Tincans, supporting English.
Text-to-Audio Transformers English
G
tincans-ai
90
99
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase